03 Automated Machine Learning with H2O (I)

Last compiled: 2021-01-05

The two parts within this section and the next section (Performance Measures) are arranged according to the CRISP-DM process model. CRISP-DM breaks the process of data analysis into six major phases:

  1. Business Understanding
  2. Data Understanding
  3. Data Preparation
  4. Modeling
  5. Evaluation
  6. Deployment

Business Understanding [1] and Data Understanding [2] are covered in the first part of this section.

Data Preparation [3] and Modeling [4] are covered in the second part of this section, which introduces the use of automated machine learning with H2O. Evaluation [5] is covered extensively in the next section (Performance Measures). Deployment [6] is not covered in this journal.

Please note, I will be using an IBM Employee Attrition data set for [1-2] and a Product Backorders data set for [3-5].

Problem

As many companies know, attrition is a problem that impacts all businesses, irrespective of geography, industry and size of the company. Employee attrition leads to significant costs for a business. Thus, there is great business interest in understanding the drivers of, and minimizing staff attrition.

The following data set presents an employee survey from IBM, indicating if there is attrition or not. The data set contains approximately 1500 entries.

While some level of attrition in a company is inevitable, minimizing it and being prepared for the cases that cannot be helped significantly improve the operations of most businesses.

True Cost of Employee Attrition

Tying a financial figure to any business problem is essential to analyzing it. An Excel Employee Turnover Cost Calculator (provided below) can communicate the size of the problem financially. It is a great way to show others in an organization the true cost of losing good employees. It’s simple to use and most business professionals have Excel to easily review an organization’s cost of turnover with them. As shown below, an organization that loses 200 productive employees per year could have a hidden cost of $15M/year in lost productivity. On top of that, most organizations don’t realize it because productivity is a hidden cost.

Employee Turnover Cost Calculator
Employee Turnover Cost Calculator

CRISP Data Sciene Framework

The work performed in the Automated Machine Learning with H2O and Performance Measures sections follows a data analysis process called CRISP-DM. CRISP-DM stands for “Cross-Industry Standard Process for Data Mining”. It is an open standard process model that describes common approaches used by data mining experts.

Following CRISP-DM guidelines, we start with a Business Understanding. It is an astoundingly common mistake to start projects without first properly defining the problem and objectives. This mistake is not specific to data analysis but is common to all types of problem-solving activities. As a result, all major problem-solving methodologies, including 8-D, six-sigma DMAIC and, of course, CRISP-DM, place first and stress the importance of Problem Definition or Business Understanding.

In the end, H2O is used to determine the probability of a certain employee to fall into the condition of “attrition” and, thus, its high risk of leaving the company. Before we are able to do that, we need a profound understanding of the business and the data.

CRISP-DM Framework
Cross-Industry Standard Process for Data Mining

Goal

The goal is to apply some level of understanding both from a business perspective and data perspective.

The work performed in this section is broken down into multiple steps as follows:
  1. Load libraries
  2. Load data and scripts
  3. Determine objective and assess situation
  4. Visualize the situation
  5. Explore and skim all data
  6. Perform data visualization
  7. Answer questions via feature exploration

As mentioned above, I will be working with an Employee Attrition data set (source of raw data is linked below). I have also provided scripts which are separate functions in order to process the data correctly for visualization. Finally, I have included a definitions table, which will be needed in the second part of this section.

You may download the data, scripts, and definitions in case you want to try this code on your own.

Raw data source:
Download datasets-1067-1925-WA_Fn-UseC_-HR-Employee-Attrition.csv

Script source:
Download assess_attrition.R
Download calculate_attrition_cost.R
Download count_to_pct.R
Download plot_attrition.R
Download plot_ggpairs.R

Definitions table source:
Download data_definitions.xlsx

Name Description
AGE Numerical Value
ATTRITION (0=NO, 1=YES) - Employee leaving the company
BUSINESS TRAVEL (1=NO TRAVEL, 2=TRAVEL FREQUENTLY, 3=TRAVEL RARELY)
DAILY RATE Numerical Value - Salary Level
DEPARTMENT (1=HR, 2=R&D, 3=SALES)
DISTANCE FROM HOME Numerical Value - Distance from work to home
EDUCATION Numerical Value
EDUCATION FIELD (1=HR, 2=LIFE SCIENCES, 3=MARKETING, 4=MEDICAL SCIENCES, 5=OTHERS, 6=TECHNICAL)
EMPLOYEE COUNT Numerical Value
EMPLOYEE NUMBER Numerical Value - Employee identification
ENVIROMENT SATISFACTION Numerical Value - Satisfaction with the environment
GENDER (1=FEMALE, 2=MALE)
HOURLY RATE Numerical Value - Hourly salary
JOB INVOLVEMENT Numerical Value
JOB LEVEL Numerical Value
JOB ROLE (1=HC REP, 2=HR, 3=LAB TECHNICIAN, 4=MANAGER, 5=MANAGING DIRECTOR, 6=RESEARCH DIRECTOR, 7=RESEARCH SCIENTIST, 8=SALES EXECUTIEVE, 9=SALES REPRESENTATIVE)
JOB SATISFACTION Numerical Value - Satisfaction with the job
MARITAL STATUS (1=DIVORCED, 2=MARRIED, 3=SINGLE)
MONTHLY INCOME Numerical Value - Monthly salary
MONTHLY RATE Numerical Value
NUMCOMPANIES WORKED Numerical Value - Number of companies worked at
OVER 18 (1=YES, 2=NO)
OVERTIME (1=NO, 2=YES)
PERCENT SALARY HIKE Numerical Value - Percentage increase in salary
PERFORMANCE RATING Numerical Value
RELATIONS SATISFACTION Numerical Value
STANDARD HOURS Numerical Value
STOCK OPTIONS LEVEL Numerical Value
TOTAL WORKING YEARS Numerical Value - Total years employee has worked
TRAINING TIMES LAST YEAR Numerical Value - Hours spent training
WORK LIFE BALANCE Numerical Value - Time spent between work and outside
YEARS AT COMPANY Numerical Value - Total number of years at the company
YEARS IN CURRENT ROLE Numerical Value
YEARS SINCE LAST PROMOTION Numerical Value
YEARS WITH CURRENT MANAGER Numerical Value - Years spent with the current manager


Step 1: Load libraries

As a first step, please load tidyverse, readxl, skimr, and GGally libraries. For details on what these libraries offer, please refer to the comments in the code block below.

# STEP 1: Load Libraries ---
# Tidy, Transform, & Visualize
library(tidyverse)
#  library(tibble)    --> is a modern re-imagining of the data frame
#  library(readr)     --> provides a fast and friendly way to read rectangular data like csv
#  library(dplyr)     --> provides a grammar of data manipulation
#  library(magrittr)  --> offers a set of operators which make your code more readable (pipe operator)
#  library(tidyr)     --> provides a set of functions that help you get to tidy data
#  library(stringr)   --> provides a cohesive set of functions designed to make working with strings as easy as possible
#  library(ggplot2)   --> graphics

library(readxl)       # reading excel files
library(skimr)        # summary function that displays nicely in console
library(GGally)       # extends 'ggplot2' to reduce complexity of combining geometric objects with transformed data

If you haven’t installed these packages, please install them by calling install.packages([name_of_package]) in the R console. After installing, run the above code block again.

Step 2: Load data and scripts

# Load data
employee_attrition_tbl <- read_csv("00_raw_data/datasets-1067-1925-WA_Fn-UseC_-HR-Employee-Attrition.csv")

path_data_definitions <- "00_raw_data/data_definitions.xlsx"
definitions_raw_tbl   <- read_excel(path_data_definitions, sheet = 1, col_names = FALSE)

# Function to convert counts to percentages
source("00_scripts/count_to_pct.R")
# Function to calculate attrition cost
source("00_scripts/calculate_attrition_cost.R")
# Function to assess attrition
source("00_scripts/assess_attrition.R")
# Function to plot attrition
source("00_scripts/plot_attrition.R")
# # Function to plot ggpairs function and explore Features by category
source("00_scripts/plot_attrition.R")

Step 3: Determine objective and assess situation

# Data subset
dept_job_role_tbl <- employee_attrition_tbl %>%
  select(EmployeeNumber, Department, JobRole, PerformanceRating, Attrition)

# Investigate objectives: 16 % Attrition
# Analyze attrition from data
dept_job_role_tbl %>%
  
  group_by(Attrition) %>%
  summarize(n = n()) %>%
  ungroup() %>%
  mutate(pct = n / sum(n))
# Synthesize outcomes: High Counts and High percentages
# Hypothesize drivers: Job Role and Departments
# Attrition by department
dept_job_role_tbl %>%
  
  group_by(Department, Attrition) %>%
  summarize(n = n()) %>%
  ungroup() %>%
  
  group_by(Department) %>%
  mutate(pct = n / sum(n))
# Attrition by job role
dept_job_role_tbl %>%
  
  group_by(Department, JobRole, Attrition) %>%
  summarize(n = n()) %>%
  ungroup() %>%
  
  group_by(Department, JobRole) %>%
  mutate(pct = n / sum(n)) %>%
  ungroup() %>%
  
  filter(Attrition %in% "Yes") %>%
  arrange(desc(pct))

Step 4: Visualize the situation

# Visualize
dept_job_role_tbl %>%
  
  # Select columnns
  count(Department, JobRole, Attrition) %>%
  count_to_pct(Department, JobRole) %>%
  
  assess_attrition(Attrition, attrition_value = "Yes", baseline_pct = 0.088) %>%
  mutate(
    cost_of_attrition = calculate_attrition_cost(n = n, salary = 80000)
  ) %>%
  
  # Select columnns
  plot_attrition(Department, JobRole, .value = cost_of_attrition,
                 units = "M") +
  labs(title = "Estimated Cost of Attrition by Job Role",
       x = "Cost of Attrition",
       y = "Department | Job Role",
       subtitle = "Looks like Sales Executive and Labaratory Technician are the biggest drivers of cost"
  )

Step 5: Explore and skim all data

# Descriptive Features
employee_attrition_tbl %>% select(Age, DistanceFromHome, Gender, MaritalStatus, NumCompaniesWorked, Over18)

# Employment Features
employee_attrition_tbl %>% select(Department, EmployeeCount, EmployeeNumber, JobInvolvement, JobLevel, JobRole, JobSatisfaction)

# Compensation Features
employee_attrition_tbl %>% select(Attrition, DailyRate, HourlyRate, MonthlyIncome, MonthlyRate, PercentSalaryHike, StockOptionLevel)

# Survey Results
employee_attrition_tbl %>% select(EnvironmentSatisfaction, JobSatisfaction, RelationshipSatisfaction, WorkLifeBalance)

# Performance Data
employee_attrition_tbl %>% select(JobInvolvement, PerformanceRating)

# Work-Life Features
employee_attrition_tbl %>% select(BusinessTravel, OverTime)

# Training & Education
employee_attrition_tbl %>% select(Education, EducationField, TrainingTimesLastYear)

# Time-Based Features
employee_attrition_tbl %>% select(TotalWorkingYears, YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion, YearsWithCurrManager)
# Data skimming
skim(employee_attrition_tbl)
Data summary
Name employee_attrition_tbl
Number of rows 1470
Number of columns 35
_______________________
Column type frequency:
character 9
numeric 26
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Attrition 0 1 2 3 0 2 0
BusinessTravel 0 1 10 17 0 3 0
Department 0 1 5 22 0 3 0
EducationField 0 1 5 16 0 6 0
Gender 0 1 4 6 0 2 0
JobRole 0 1 7 25 0 9 0
MaritalStatus 0 1 6 8 0 3 0
Over18 0 1 1 1 0 1 0
OverTime 0 1 2 3 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Age 0 1 36.92 9.14 18 30.00 36.0 43.00 60 ▂▇▇▃▂
DailyRate 0 1 802.49 403.51 102 465.00 802.0 1157.00 1499 ▇▇▇▇▇
DistanceFromHome 0 1 9.19 8.11 1 2.00 7.0 14.00 29 ▇▅▂▂▂
Education 0 1 2.91 1.02 1 2.00 3.0 4.00 5 ▂▃▇▆▁
EmployeeCount 0 1 1.00 0.00 1 1.00 1.0 1.00 1 ▁▁▇▁▁
EmployeeNumber 0 1 1024.87 602.02 1 491.25 1020.5 1555.75 2068 ▇▇▇▇▇
EnvironmentSatisfaction 0 1 2.72 1.09 1 2.00 3.0 4.00 4 ▅▅▁▇▇
HourlyRate 0 1 65.89 20.33 30 48.00 66.0 83.75 100 ▇▇▇▇▇
JobInvolvement 0 1 2.73 0.71 1 2.00 3.0 3.00 4 ▁▃▁▇▁
JobLevel 0 1 2.06 1.11 1 1.00 2.0 3.00 5 ▇▇▃▂▁
JobSatisfaction 0 1 2.73 1.10 1 2.00 3.0 4.00 4 ▅▅▁▇▇
MonthlyIncome 0 1 6502.93 4707.96 1009 2911.00 4919.0 8379.00 19999 ▇▅▂▁▂
MonthlyRate 0 1 14313.10 7117.79 2094 8047.00 14235.5 20461.50 26999 ▇▇▇▇▇
NumCompaniesWorked 0 1 2.69 2.50 0 1.00 2.0 4.00 9 ▇▃▂▂▁
PercentSalaryHike 0 1 15.21 3.66 11 12.00 14.0 18.00 25 ▇▅▃▂▁
PerformanceRating 0 1 3.15 0.36 3 3.00 3.0 3.00 4 ▇▁▁▁▂
RelationshipSatisfaction 0 1 2.71 1.08 1 2.00 3.0 4.00 4 ▅▅▁▇▇
StandardHours 0 1 80.00 0.00 80 80.00 80.0 80.00 80 ▁▁▇▁▁
StockOptionLevel 0 1 0.79 0.85 0 0.00 1.0 1.00 3 ▇▇▁▂▁
TotalWorkingYears 0 1 11.28 7.78 0 6.00 10.0 15.00 40 ▇▇▂▁▁
TrainingTimesLastYear 0 1 2.80 1.29 0 2.00 3.0 3.00 6 ▂▇▇▂▃
WorkLifeBalance 0 1 2.76 0.71 1 2.00 3.0 3.00 4 ▁▃▁▇▂
YearsAtCompany 0 1 7.01 6.13 0 3.00 5.0 9.00 40 ▇▂▁▁▁
YearsInCurrentRole 0 1 4.23 3.62 0 2.00 3.0 7.00 18 ▇▃▂▁▁
YearsSinceLastPromotion 0 1 2.19 3.22 0 0.00 1.0 3.00 15 ▇▁▁▁▁
YearsWithCurrManager 0 1 4.12 3.57 0 2.00 3.0 7.00 17 ▇▂▅▁▁

Step 6: Perform data visualization

# Visualize all data
employee_attrition_tbl %>%
  select(Attrition, Age, Gender, MaritalStatus, NumCompaniesWorked, Over18, DistanceFromHome) %>%
  ggpairs() 

# Create data tibble, to potentially debug the plot_ggpairs function (because it has a data argument)
data <- employee_attrition_tbl %>%
  select(Attrition, Age, Gender, MaritalStatus, NumCompaniesWorked, Over18, DistanceFromHome)

employee_attrition_tbl %>%
  select(Attrition, Age, Gender, MaritalStatus, NumCompaniesWorked, Over18, DistanceFromHome) %>%
  plot_ggpairs(color = Attrition)

Step 7: Answer questions via feature exploration

# Explore Features by Category

#   1. Descriptive features: Age, Gender, Marital Status 
employee_attrition_tbl %>%
  select(Attrition, Age, Gender, MaritalStatus, NumCompaniesWorked, Over18, DistanceFromHome) %>%
  plot_ggpairs(Attrition)

#   2. Employment features: Department, Job Role, Job Level
employee_attrition_tbl %>%
  select(Attrition, contains("employee"), contains("department"), contains("job")) %>%
  plot_ggpairs(Attrition) 

#   3. Compensation features: Hourly Rate, Monthly Income, Stock Option Level 
employee_attrition_tbl %>%
  select(Attrition, contains("income"), contains("rate"), contains("salary"), contains("stock")) %>%
  plot_ggpairs(Attrition)

#   4. Survey Results: Satisfaction Level, Work Life Balance 
employee_attrition_tbl %>%
  select(Attrition, contains("satisfaction"), contains("life")) %>%
  plot_ggpairs(Attrition)

#   5. Performance Data: Job Involvement, Performance Rating
employee_attrition_tbl %>%
  select(Attrition, contains("performance"), contains("involvement")) %>%
  plot_ggpairs(Attrition)

#   6. Work-Life Features 
employee_attrition_tbl %>%
  select(Attrition, contains("overtime"), contains("travel")) %>%
  plot_ggpairs(Attrition)

#   7. Training and Education 
employee_attrition_tbl %>%
  select(Attrition, contains("training"), contains("education")) %>%
  plot_ggpairs(Attrition)

#   8. Time-Based Features: Years At Company, Years In Current Role
employee_attrition_tbl %>%
  select(Attrition, contains("years")) %>%
  plot_ggpairs(Attrition)


Challenge Questions

  1. Compensation Features
  2. Question:  What can I deduce about the interaction between Monthly Income and Attrition?
    Answer:    C. Those that are leaving have a lower Monthly Income

  3. Compensation Features
  4. Question:  What can I deduce about the interaction between Percent Salary Hike and Attrition?
    Answer:    D. It’s difficult to deduce anything based on the visualization

  5. Compensation Features
  6. Question:  What can I deduce about the interaction between Stock Option Level and Attrition?
    Answer:    C. It’s difficult to deduce anything based on the visualization

  7. Survey Results
  8. Question:  What can I deduce about the interaction between Environment Satisfaction and Attrition?
    Answer:    A. A higher proportion of those leaving have a low environment satisfaction level

  9. Survey Results
  10. Question:  What can I deduce about the interaction between Work Life Balance and Attrition?
    Answer:    B. Those that are staying have a higher density of 2’s and 3’s

  11. Performance Data
  12. Question:  What can I deduce about the interaction between Job Involvement and Attrition?
    Answer:    A. Those that are leaving have a lower density of 3’s and 4’s

  13. Work-Life Features
  14. Question:  What can I deduce about the interaction between Over Time and Attrition?
    Answer:    B. The proportion of those staying that are working Over Time are high compared to those that are not staying

  15. Training and Education
  16. Question:  What can I deduce about the interaction between Training Times Last Year and Attrition?
    Answer:    B. People that leave tend to have less annual training

  17. Time-Based Features
  18. Question:  What can I deduce about the interaction between Years At Company and Attrition?
    Answer:    B. People that leave tend to have less working years at the company

  19. Time-Based Features
  20. Question:  What can I deduce about the interaction between Years Since Last Promotion and Attrition?
    Answer:    C. It’s difficult to deduce anything based on the visualization


03 Automated Machine Learning with H2O (II)

Last compiled: 2021-01-05

Goal

The goal is to predict whether or not a product will be put on ‘backorder’ status, given a number of product metrics such as current inventory, transit time, demand forecasts and prior sales. This is a classic Binary Classification problem, and I will use Automated Machine Leaning with H2O to tackle this problem.

The work performed here is broken down into multiple steps as follows:
  1. Load libraries
  2. Load training & test data sets
  3. Specify response and predictor variables
  4. Perform AutoML H2O specifying stopping criterion
  5. View the Leaderboard
  6. Save the leader model
  7. Predict using leader model

For this, I will be working with a Product Backorders data set (source of raw data is linked below). You may download the data in case you want to try this code on your own.

Raw data source:
Download product_backorders.csv

Step 1: Load libraries

As a first step, please load tidyverse and tidymodels libraries. For details on what these libraries offer, please refer to the comments in the code block below.

# STEP 1: Load Libraries ---
# Tidy, Transform, & Visualize
library(tidyverse)
#  library(tibble)    --> is a modern re-imagining of the data frame
#  library(readr)     --> provides a fast and friendly way to read rectangular data like csv
#  library(dplyr)     --> provides a grammar of data manipulation
#  library(magrittr)  --> offers a set of operators which make your code more readable (pipe operator)
#  library(tidyr)     --> provides a set of functions that help you get to tidy data
#  library(stringr)   --> provides a cohesive set of functions designed to make working with strings as easy as possible
#  library(ggplot2)   --> graphics

library(tidymodels)
# library(rsample)    --> provides infrastructure for efficient data splitting, resampling and cross validation.
# library(parsnip)    --> provides an API to many powerful modeling algorithms in R.
# library(recipes)    --> tidy interface to data pre-processing (making statistical transformations) tools for feature engineering (prior to modeling).
# library(workflows)  --> bundle your pre-processing, modeling, and post-processing together.
# library(tune)       --> helps you optimize the hyperparameters of your model and pre-processing steps.
# library(yardstick)  --> measures the effectiveness of models using performance metrics (metrics for model comparison).
# library(broom)      --> converts the information in common statistical R objects into user-friendly, predictable formats.
# library(dials)      --> creates and manages tuning parameters and parameter grids.

library(h2o)          # H2O modeling

If you haven’t installed these packages, please install them by calling install.packages([name_of_package]) in the R console. After installing, run the above code block again.

Step 2: Load training & test data sets

product_backorders_tbl          <- read_csv("00_raw_data/product_backorders.csv")

set.seed(seed = 1113)
split_obj                       <- rsample::initial_split(product_backorders_tbl, prop = 0.75)
train_readable_tbl              <- training(split_obj)
test_readable_tbl               <- testing(split_obj)

# We need to convert those columns to factors in the next step
factor_names <- c("potential_issue", "deck_risk", "oe_constraint", 
                  "ppap_risk", "stop_auto_buy", "rev_stop")

Step 3: Specify response and predictor variables

# Create recipe
recipe_obj <- recipe(went_on_backorder ~., data = train_readable_tbl) %>% 
  step_zv(all_predictors()) %>% 
  step_mutate_at(factor_names, fn = as.factor) %>%
  step_center(all_numeric()) %>%
  step_scale(all_numeric()) %>%
  step_dummy(all_nominal(),-all_outcomes()) %>% 
  prep()

# To finalize this process, bake the train & test data
train_tbl <- bake(recipe_obj, new_data = train_readable_tbl)
train_tbl %>% glimpse()
## Rows: 14,290
## Columns: 23
## $ sku                 <dbl> -1.428355, -1.428134, -1.427221, -1.426697, -1.425791, -1.424842, -1.423346, -1.422762, -1.422…
## $ national_inv        <dbl> -0.05080693, -0.05080693, -0.04808514, -0.05080693, -0.05080693, -0.04332202, -0.05026257, -0.…
## $ lead_time           <dbl> 0.03607535, 0.03607535, -0.84681630, 0.03607535, 0.62466979, 0.03607535, 0.18322396, 0.0360753…
## $ in_transit_qty      <dbl> -0.02825479, -0.02887894, -0.02887894, -0.02887894, -0.02887894, -0.02887894, -0.02887894, -0.…
## $ forecast_3_month    <dbl> -0.033695988, -0.034645287, -0.024440325, -0.032984014, -0.027762871, 0.016142198, -0.02491497…
## $ forecast_6_month    <dbl> -0.033009354, -0.033364205, -0.022009001, -0.032063087, -0.030052270, 0.008862959, -0.02579406…
## $ forecast_9_month    <dbl> -0.033186224, -0.033348357, -0.021269486, -0.031970230, -0.031159568, 0.006211972, -0.02435000…
## $ sales_1_month       <dbl> -0.031049401, -0.030432083, -0.021172316, -0.027962812, -0.026728176, -0.012529867, -0.0279628…
## $ sales_3_month       <dbl> -0.030649554, -0.031073243, -0.022599460, -0.027895574, -0.028319264, -0.008617716, -0.0268363…
## $ sales_6_month       <dbl> -0.029670305, -0.030286209, -0.022382100, -0.026898734, -0.027206686, -0.002365199, -0.0264881…
## $ sales_9_month       <dbl> -0.029467911, -0.030065430, -0.022895208, -0.027409792, -0.027144228, -0.007691681, -0.0264139…
## $ min_bank            <dbl> -0.052350594, -0.052350594, -0.039256119, -0.052350594, -0.049731699, 0.014431230, -0.04711280…
## $ pieces_past_due     <dbl> -0.01006667, -0.03784284, -0.03784284, -0.03784284, 0.04548566, -0.03784284, -0.03784284, -0.0…
## $ perf_6_month_avg    <dbl> 0.2831149, 0.2854431, 0.2796226, 0.2854431, 0.2862192, 0.2481916, 0.2749661, 0.2869953, 0.2858…
## $ perf_12_month_avg   <dbl> 0.2734724, 0.2766722, 0.2730724, 0.2770722, 0.2770722, 0.2378747, 0.2650729, 0.2758723, 0.2746…
## $ local_bo_qty        <dbl> -0.04204058, -0.04204058, -0.04204058, -0.04204058, -0.04204058, -0.04204058, -0.04204058, -0.…
## $ went_on_backorder   <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,…
## $ potential_issue_Yes <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ deck_risk_Yes       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0…
## $ oe_constraint_Yes   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ppap_risk_Yes       <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ stop_auto_buy_Yes   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ rev_stop_Yes        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
test_tbl <- bake(recipe_obj, new_data = test_readable_tbl)
test_tbl %>% glimpse()
## Rows: 4,763
## Columns: 23
## $ sku                 <dbl> -1.424591, -1.416684, -1.411482, -1.406362, -1.404739, -1.402797, -1.402192, -1.401664, -1.400…
## $ national_inv        <dbl> -0.05543396, -0.04985430, -0.05080693, -0.03039353, -0.05080693, -0.04481900, -0.05039866, -0.…
## $ lead_time           <dbl> 0.03607535, 0.03607535, 0.62466979, 0.03607535, 0.03607535, -1.14111352, -0.69966769, 0.036075…
## $ in_transit_qty      <dbl> -0.02887894, -0.02887894, -0.02887894, -0.02887894, -0.02887894, -0.02825479, -0.02887894, -0.…
## $ forecast_3_month    <dbl> -0.006640973, -0.035119937, 0.003563989, 0.042010590, -0.034407962, 0.006886535, -0.031322741,…
## $ forecast_6_month    <dbl> -0.0053310456, -0.0337190547, -0.0104172306, 0.0402080523, -0.0328910711, 0.0008196897, -0.030…
## $ forecast_9_month    <dbl> -0.0142167223, -0.0336726217, -0.0149463185, 0.0656335315, -0.0323755617, -0.0004354602, -0.03…
## $ sales_1_month       <dbl> 0.020187977, -0.030432083, -0.031049401, 0.089944889, -0.027962812, 0.009076256, -0.027345494,…
## $ sales_3_month       <dbl> -5.651892e-03, -3.128509e-02, -2.048101e-02, 6.129100e-02, -2.831926e-02, 7.034456e-04, -2.874…
## $ sales_6_month       <dbl> -0.015812450, -0.030388860, -0.019097275, 0.066821423, -0.028027892, -0.004931468, -0.02813054…
## $ sales_9_month       <dbl> -0.019310097, -0.030065430, -0.019509270, 0.052259342, -0.028206483, -0.009749801, -0.02800731…
## $ min_bank            <dbl> -0.009138826, -0.052350594, -0.052350594, -0.052350594, -0.051041147, 0.028835153, -0.05235059…
## $ pieces_past_due     <dbl> -0.03784284, -0.03784284, -0.03784284, -0.03784284, -0.03784284, -0.03784284, -0.03784284, -0.…
## $ perf_6_month_avg    <dbl> 0.2869953, 0.2869953, 0.2765183, 0.2866072, 0.2862192, 0.2481916, 0.2827269, 0.2769063, 0.2866…
## $ perf_12_month_avg   <dbl> 0.2766722, 0.2770722, 0.2654729, 0.2750723, 0.2770722, 0.2378747, 0.2738724, 0.2694727, 0.2774…
## $ local_bo_qty        <dbl> 1.3983256987, -0.0420405758, -0.0420405758, -0.0420405758, -0.0420405758, -0.0420405758, -0.04…
## $ went_on_backorder   <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,…
## $ potential_issue_Yes <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ deck_risk_Yes       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ oe_constraint_Yes   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ppap_risk_Yes       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ stop_auto_buy_Yes   <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1…
## $ rev_stop_Yes        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Step 4: Perform AutoML H2O specifying stopping criterion

# H2O modeling cluster initialization
h2o.init()

# Split data into a training and a validation data frame
# Setting the seed is just for reproducibility
split_h2o <- h2o.splitFrame(as.h2o(train_tbl), ratios = c(0.75), seed = 1435)
train_h2o <- split_h2o[[1]]
valid_h2o <- split_h2o[[2]]
test_h2o  <- as.h2o(test_tbl)

# Set the target and predictors
y <- "went_on_backorder"
x <- setdiff(names(train_h2o), y)

# Run AutoML
automl_models_h2o <- h2o.automl(x = x,
                                y = y,
                                training_frame    = train_h2o,
                                validation_frame  = valid_h2o,
                                leaderboard_frame = test_h2o,
                                max_runtime_secs  = 30,
                                nfolds            = 5)

Step 5: View the Leaderboard

# Inspect the leaderboard
typeof(automl_models_h2o)
## [1] "S4"
slotNames(automl_models_h2o)
## [1] "project_name"   "leader"         "leaderboard"    "event_log"      "modeling_steps" "training_info"
automl_models_h2o@leaderboard 
##                                              model_id       auc   logloss     aucpr mean_per_class_error      rmse        mse
## 1    StackedEnsemble_AllModels_AutoML_20210105_210409 0.9452838 0.1882320 0.7172445            0.1608892 0.2300236 0.05291086
## 2 StackedEnsemble_BestOfFamily_AutoML_20210105_210409 0.9452067 0.1877056 0.7227806            0.1616021 0.2297830 0.05280024
## 3      XGBoost_grid__1_AutoML_20210105_210409_model_1 0.9412864 0.1793778 0.7223598            0.1705118 0.2282421 0.05209444
## 4                    XGBoost_3_AutoML_20210105_210409 0.9325234 0.2008155 0.7195051            0.1665624 0.2351780 0.05530871
## 5                    XGBoost_2_AutoML_20210105_210409 0.9274548 0.2340078 0.6933041            0.1699284 0.2469269 0.06097290
## 6          GBM_grid__1_AutoML_20210105_210409_model_1 0.9264239 0.2168210 0.6431307            0.1767898 0.2522805 0.06364543
## 
## [16 rows x 7 columns]
automl_models_h2o@leader
# Extract leader model from the leaderboard
# automl_models_h2o@leader gives you the same output
h2o.getModel("StackedEnsemble_AllModels_AutoML_20210105_210409")
## Model Details:
## ==============
## 
## H2OBinomialModel: stackedensemble
## Model ID:  StackedEnsemble_AllModels_AutoML_20210105_210409 
## Number of Base Models: 14
## 
## Base Models (count by algorithm type):
## 
## deeplearning          drf          gbm          glm      xgboost 
##            1            2            6            1            4 
## 
## Metalearner:
## 
## Metalearner algorithm: glm
## Metalearner cross-validation fold assignment:
##   Fold assignment scheme: AUTO
##   Number of folds: 5
##   Fold column: NULL
## Metalearner hyperparameters: 
## 
## 
## H2OBinomialMetrics: stackedensemble
## ** Reported on training data. **
## 
## MSE:  0.04141113
## RMSE:  0.2034973
## LogLoss:  0.1479903
## Mean Per-Class Error:  0.1289896
## AUC:  0.9719059
## AUCPR:  0.8414303
## Gini:  0.9438118
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          No  Yes    Error       Rate
## No     8509  290 0.032958  =290/8799
## Yes     268  923 0.225021  =268/1191
## Totals 8777 1213 0.055856  =558/9990
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold       value idx
## 1                       max f1  0.339212    0.767887 207
## 2                       max f2  0.118825    0.829099 284
## 3                 max f0point5  0.715452    0.793560 106
## 4                 max accuracy  0.367397    0.944444 199
## 5                max precision  0.992906    1.000000   0
## 6                   max recall  0.026268    1.000000 363
## 7              max specificity  0.992906    1.000000   0
## 8             max absolute_mcc  0.339212    0.736182 207
## 9   max min_per_class_accuracy  0.099521    0.915104 294
## 10 max mean_per_class_accuracy  0.085198    0.917104 301
## 11                     max tns  0.992906 8799.000000   0
## 12                     max fns  0.992906 1185.000000   0
## 13                     max fps  0.015984 8799.000000 399
## 14                     max tps  0.026268 1191.000000 363
## 15                     max tnr  0.992906    1.000000   0
## 16                     max fnr  0.992906    0.994962   0
## 17                     max fpr  0.015984    1.000000 399
## 18                     max tpr  0.026268    1.000000 363
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: stackedensemble
## ** Reported on validation data. **
## 
## MSE:  0.05196215
## RMSE:  0.2279521
## LogLoss:  0.180089
## Mean Per-Class Error:  0.1281524
## AUC:  0.9551248
## AUCPR:  0.7584651
## Gini:  0.9102496
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          No Yes    Error       Rate
## No     2914 186 0.060000  =186/3100
## Yes      85 348 0.196305    =85/433
## Totals 2999 534 0.076705  =271/3533
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold       value idx
## 1                       max f1  0.204854    0.719752 232
## 2                       max f2  0.085253    0.795409 295
## 3                 max f0point5  0.682347    0.736592 107
## 4                 max accuracy  0.406547    0.930088 177
## 5                max precision  0.990655    1.000000   0
## 6                   max recall  0.019942    1.000000 382
## 7              max specificity  0.990655    1.000000   0
## 8             max absolute_mcc  0.204854    0.680865 232
## 9   max min_per_class_accuracy  0.094962    0.894194 289
## 10 max mean_per_class_accuracy  0.085253    0.898701 295
## 11                     max tns  0.990655 3100.000000   0
## 12                     max fns  0.990655  432.000000   0
## 13                     max fps  0.016080 3100.000000 399
## 14                     max tps  0.019942  433.000000 382
## 15                     max tnr  0.990655    1.000000   0
## 16                     max fnr  0.990655    0.997691   0
## 17                     max fpr  0.016080    1.000000 399
## 18                     max tpr  0.019942    1.000000 382
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: stackedensemble
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.05328307
## RMSE:  0.2308313
## LogLoss:  0.1856351
## Mean Per-Class Error:  0.1471826
## AUC:  0.9483857
## AUCPR:  0.7403913
## Gini:  0.8967713
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          No  Yes    Error        Rate
## No     8988  491 0.051799   =491/9479
## Yes     310  968 0.242567   =310/1278
## Totals 9298 1459 0.074463  =801/10757
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold       value idx
## 1                       max f1  0.212150    0.707344 243
## 2                       max f2  0.072031    0.771386 310
## 3                 max f0point5  0.503071    0.722812 158
## 4                 max accuracy  0.503071    0.930092 158
## 5                max precision  0.992270    1.000000   0
## 6                   max recall  0.014454    1.000000 395
## 7              max specificity  0.992270    1.000000   0
## 8             max absolute_mcc  0.212150    0.666813 243
## 9   max min_per_class_accuracy  0.070643    0.884587 311
## 10 max mean_per_class_accuracy  0.068267    0.885621 313
## 11                     max tns  0.992270 9479.000000   0
## 12                     max fns  0.992270 1277.000000   0
## 13                     max fps  0.012123 9479.000000 399
## 14                     max tps  0.014454 1278.000000 395
## 15                     max tnr  0.992270    1.000000   0
## 16                     max fnr  0.992270    0.999218   0
## 17                     max fpr  0.012123    1.000000 399
## 18                     max tpr  0.014454    1.000000 395
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

Step 6: Save the leader model

# Extracts an H2O model name by a position so can more easily use h2o.getModel()
extract_h2o_model_name_by_position <- function(h2o_leaderboard, n = 1, verbose = T) {
  
  model_name <- h2o_leaderboard %>%
    as.tibble() %>%
    slice(n) %>%
    pull(model_id)
  
  if (verbose) message(model_name)
  
  return(model_name)
  
}

# Save the leader model by extracting from leaderboard
automl_models_h2o@leaderboard %>% 
  extract_h2o_model_name_by_position(1) %>% 
  h2o.getModel() %>%
  h2o.saveModel(path = "00_h2o_models/03/")

Step 7: Predict using leader model

# Loading top H2O model
stacked_ensemble_h2o <- h2o.loadModel("00_h2o_models/03/StackedEnsemble_AllModels_AutoML_20210105_210409")

predictions <- h2o.predict(stacked_ensemble_h2o, newdata = as.h2o(test_tbl))
predictions_tbl <- predictions %>% as_tibble()

predictions_tbl

This concludes the Automated Machine Learning with H2O section in R!

Made with ♥
~Roy Ruiz~